library(ngsReports)
library(magrittr)
library(scales)
library(pander)
library(tidyverse)
rawFqc <- list.files("../0_rawData/FastQC/", pattern = "zip", full.names = TRUE) %>%
getFastqcData()
deMuxFqc <- list.files("../1_demux/FastQC/", pattern = "zip", full.names = TRUE) %>%
getFastqcData()
samples <- read_tsv("../0_rawData/samples.tsv")
Each of the libraries was inspected for overall quality. Positions 3 & 4 from R2 libraries FCC21WPACXX-CHKPEI13070002_L6 and FCC21WPACXX-CHKPEI13070003_L7 showed clear problems with read qualities. This was likely due to an unsatisfactory nucleotide diversity in the original sequencing run and not enough phiX to overcome this, particularly as all R2 reads will terminate with the restriction site.
Base qualities before demultiplexing.
The first check after demultiplexing is to ensure that read were not assigned to multiple individuals by sabre pe. Read Totals before and after demultiplexing were then checked and the recovery rate was >90% for each library. This process was only applied to the data from the 1996 and 2012 populations as the Turretfield samples were provided after demultiplexing by the collaborators from this dataset. Results indicated that demultiplexing was performed with a high degree of success.
rawReadTotals <- readTotals(rawFqc) %>%
mutate(Library = str_remove_all(Filename, "_[12].fq.gz")) %>%
distinct(Library, Total_Sequences)
deMuxReadTotals <- readTotals(deMuxFqc) %>%
mutate(ID = str_remove_all(Filename, ".[12].fq.gz")) %>%
distinct(ID, Total_Sequences) %>%
left_join(samples, by = "ID") %>%
group_by(Library) %>%
summarise(DeMultiplexed = sum(Total_Sequences)) %>%
filter(!is.na(Library))
| Library | Total Sequences | DeMultiplexed | Recovery Rate |
|---|---|---|---|
| FCC21WPACXX-CHKPEI13070002_L6 | 69,729,067 | 63,767,250 | 91.45% |
| FCC21WPACXX-CHKPEI13070003_L7 | 65,481,942 | 65,031,823 | 99.31% |
| FCC229TACXX-CHKPEI13070001_L3 | 56,981,567 | 56,588,142 | 99.31% |
| FCC2GPDACXX-CHKPEI13070004_L2 | 78,411,565 | 78,043,413 | 99.53% |
| FCC2GPDACXX-CHKPEI13070005_L3 | 78,445,394 | 77,182,810 | 98.39% |
| FCC2GPDACXX-CHKPEI13070006_L4 | 70,874,412 | 69,781,770 | 98.46% |
| FCC2GPDACXX-CHKPEI13070007_L6 | 66,601,007 | 66,235,650 | 99.45% |
Read Totals for each 1996 & 2012 samples
Read Totals for each 1996 & 2012 samples
GC content for all 1996 and 2012 samples.
Inspection by GC content showed the gc2709 and gc2700 appeared to have an exaggerated peak around 60%, whilst all other samples showed a more broad spread across the range. From the Turretfield samples, pt1125 showed an unexpected peak around 50% indicating that sample may contain reads from a different species. This sample should be excluded from all further analysis.
GC content for all Turretfield samples.
Adapter content was surprisingly high for some samples, and trimming may be appropriate. Note that this plot understates read lengths.
Adapter Content for all 1996 & 2012 samples.
Length Distributoin for R1 files
Length Distributoin for R2 files
Sequence content showing the presence of the RS in both the Turretfield and R2 samples.